Cro36WSD: A Lexical Sample for Croatian Word Sense Disambiguation
نویسندگان
چکیده
We introduce Cro36WSD, a freely-available medium-sized lexical sample for Croatian word sense disambiguation (WSD). Cro36WSD comprises 36 words: 12 adjectives, 12 nouns, and 12 verbs, balanced across both frequency bands and polysemy levels. We adopt the multi-label annotation scheme in the hope of lessening the drawbacks of discrete sense inventories and obtaining more realistic annotations from human experts. Sense-annotated data is collected through multiple annotation rounds to ensure high-quality annotations: with a 115 person-hours effort we reached an inter-annotator agreement score of 0.877. We analyze the obtained data and perform a correlation analysis between several relevant variables, including word frequency, number of senses, sense distribution skewness, average annotation time, and the observed inter-annotator agreement (IAA). Using the obtained data, we compile multiand single-labeled dataset variants using different label aggregation schemes. Finally, we evaluate three different baseline WSD models on both dataset variants and report on the insights gained. We make both dataset variants freely available.
منابع مشابه
Experiments on Active Learning for Croatian Word Sense Disambiguation
Supervised word sense disambiguation (WSD) has been shown to achieve state-ofthe-art results but at high annotation costs. Active learning can ameliorate that problem by allowing the model to dynamically choose the most informative word contexts for manual labeling. In this paper we investigate the use of active learning for Croatian WSD. We adopt a lexical sample approach and compile a corresp...
متن کاملUsing Domain Information for Word Sense Disambiguation
The major goal in ITC-irst's participation at SENSEVAL-2 was to test the role of domain information in word sense disambiguation. The underlying working hypothesis is that domain labels, such as MEDICINE, ARCHITECTURE and SPORT provide a natural way to establish semantic relations among word senses, which can be profitably used during the disambiguation process. For each task in which we partic...
متن کاملSemEval-2007 Task 05: Multilingual Chinese-English Lexical Sample
The Multilingual Chinese-English lexical sample task at SemEval-2007 provides a framework to evaluate Chinese word sense disambiguation and to promote research. This paper reports on the task preparation and the results of six participants.
متن کاملUse of Machine Readable Dictionaries for Word-Sense Disambiguation in SENSEVAL-2
CL Research's word-sense disambiguation (WSD) system is part of the DIMAP dictionary software, designed to use any full dictionary as the basis for unsupervised disambiguation. Official SENSEV AL-2 results were generated using WordNet, and separately using the New Oxford Dictionary of English (NODE). The disambiguation functionality exploits whatever information is made available by the lexical...
متن کاملClass-based collocations for Word Sense Disambiguation
This paper describes the NMSU-Pitt-UNCA word-sense disambiguation system participating in the Senseval-3 English lexical sample task. The focus of the work is on using semantic class-based collocations to augment traditional word-based collocations. Three separate sources of word relatedness are used for these collocations: 1) WordNet hypernym relations; 2) cluster-based word similarity classes...
متن کامل